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ABSTRACT 

Large data- sets defined on the sphere arise in many fields. In particular, recent and forthcoming observations of the anisotropics of the 
cosmic microwave background (CMB) made on the celestial sphere contain approximately three and fifty mega-pixels respectively. 
The compression of such data is therefore becoming increasingly important. We develop algorithms to compress data defined on the 
sphere. A Haar wavelet transform on the sphere is used as an energy compression stage to reduce the entropy of the data, followed 
by Huffman and run-length encoding stages. Lossless and lossy compression algorithms are developed. We evaluate compression 
performance on simulated CMB data, Earth topography data and environmental illumination maps used in computer graphics. The 
CMB data can be compressed to approximately 40% of its original size for essentially no loss to the cosmological information content 
of the data, and to approximately 20% if a small cosmological information loss is tolerated. For the topographic and illumination data 
compression ratios of approximately 40: 1 can be achieved when a small degradation in quality is allowed. We make our SZIP program 
that implements these compression algorithms available publicly. 

Key words, methods: numerical - cosmology: cosmic background radiation 



1. Introduction 

Large data- sets that are measured or defined inherently on the 
sphere arise in a range of applications. Examples include envi- 
ronmental illumination maps and reflectance functions used in 
computer graphics (e.g. Ramamoorthi & Hanrahan 2004), as- 
tronomical observations made on the celestial sphere, such as 



the cosmic microwave background (CMB) (e.g. |Bennett et al. 



|1996||2003| ), and applications in many other fields, such as plan- 
etary science (e.g. Wieczorek 2006; Wieczorek & Phillips 1998; 
Turcotte et al.|1981|), geoph ysics (g.gT [Whaler|1994[|Swenson &| 



Wahr 2002 



Choi et al. 



Sim ons et al.||2006) and quan tum chemistry (e.g. 
1999[ |Ritchie & Kemp||1^99] ). Technological ad- 



vances in observational instrumentation and improvements in 
computing power are resulting in significant increases in the 
size of data-sets defined on the sphere (hereafter we refer to 
a data-set defined on the sphere as a data-sphere). For exam- 
ple, current and forthcoming observations of the anisotropies of 
the CMB are of considerable size. Recent observations made by 
the Wilkinson Microwave Anisotropy Probe (WMAP) satellite 
(Ben nett et al.| [1996 , 2003 ) contain approxi mately three mega- 
pixels, while the forthcoming Planck mission (Planck collabora^] 



tion 



2005 ) will generate data-spheres with approximately fifty 



mega-pixels. Furthermore, cosmological analyses of these data 
often require the use of Monte Carlo simulations, which generate 
in the order of a thousand-fold increase in data size. The efficient 
and accurate compression of data-spheres is therefore becoming 
increasingly important for both the dissemination and storage of 
data. 

In general, data compression algorithms usually consist of 
an energy compression stage (often a transform or filtering pro- 



cess), followed by quantisation and entropy encoding stages. For 
example, JPEG (ISO/IEC IS 10918-1) uses a discrete cosine 
transform for the energy compression stage, whereas JPEG2000 
(ISO/IEC 15444-1:2004) uses a discrete wavelet transform. 
Due to the simultaneous localisation of signal content in scale 
and space afforded by a wavelet transform, one would expect 
wavelet-based energy compression to perform well relative to 
other methods. Wavelet theory in Euclidean space is well es- 
tablished (see Daubechies (1992) for a detailed introduction), 
however the same cannot yet be said for wavelet theory on the 
sphere. A number of attempts have been made to extend wavelets 
to the sphere. Discrete second generation wavelets on the sphere 
that are based on a multiresolution analysis have been developed 
( [Schroder & Sweldens| [T995| |Sweldens| [[996 ). Haar wavelets 
on the sphere for particular pixelisation schemes have also been 
developed ( Tenorio et al. 1999 ; Barrei ro et al. 2000 ). These dis- 
crete constructions allow for the exact reconstruction of a signal 
from its wavelet coe fficients bu t they may not necessarily lead 
to a stable basis (see |Sweldens] ( |1997| and references therein). 
Other authors have focused on continuous wavelet methodolo- 



gies on the sphere ([Freeden & Win dheuser 1997; Freede n et al. 
[T997| [HoTschneider l |1996[ |Torresani| |1995[ |Dahlke & Maass 
M9961 Antoine & V andergheynst | |1998| |1999| [Antoine et al. 
[20021 1 2004[ jPemanet & Vandergheynst| |2003| |Wiaux et al. 

|2006| ). Although sig- 



2005, Sanz et al. 2006, McEwen et al. 



nals can be reconstructed exactly from their wavelet coefficients 
in these continuous methodologies in theory, the absence of an 
infinite range of dilations precludes exact reconstruction in prac- 
tice. Approximate reconstruction formula may be developed by 
building discrete wavelet frames that are based on the continu- 
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ous methodology (e.g. Bogdanova et al. 2005 ). More recently, 
filter bank wavelet methodologies that are essentially based on 
a continuous wavelet framework have been developed for the 
axi- sy mmetric ( Starck etaLj 2006 ) and directional ( Wiau x et al. 



2008 ) cases. These methodologies allow the exact reconstruc 
tion of a signal from its wavelet coefficients in theory and in 
practice. Compression applications require a wavelet transform 
on the sphere that allows exact reconstruction, thus the method- 
ologies of|Schroder & Sweldensl([T995l ), |Tenorio et al.|il999), 
IBarreiro et al.| ( |2000| ), |Starck et al.lpOOSf and |Wiauxet al.|(2008) 
are candidates. 

Data compression algorithms on the sphere that use 
wavelet or alternative transforms have been developed already. 
Compression on the sphere was considered first, to our knowl- 
edge, in the pioneering work of |Schroder & Swel dens| ( |1995| ). 
The lifting scheme was used here to define a discrete wavelet 
transform on the sphere, however compression was analysed 
only in terms of the number of wavelet coefficients required to 
represent a data-sphere in a lossy manner and no encoding stage 
was performed. The addi tion of an encoding sta ge to this algo- 
rithm was performed by Kolarov & Lynch| (l997 ) using zero-tree 
coding methods. An alternative compression algorithm ba sed on 
a Faber decomposition has been proposed by Assaf (1999 ), how- 
ever no encoding stage is included and performance is again 
analysed only in terms of the number of coefficients required 
to recover a lossy representation of the data-sphere. The data- 
sphere compr ession algorithm devised by [Schroder & Swel dens| 
(1995]) and Kolarov & Lynch| ( 1997 ) therefore constitutes the 
current state-of-the-art. This algorithm relies on an icosahedron 
pixelisation of the sphere that is based on triangular subdivi- 
sions. The corresponding pixelisation of the sphere precludes 
pixel centres located on rings of constant latitude. Constant lat- 
itude pixelisations of the sphere are of considerable practical 
use since this property allows the development of many fast al- 
gorithms on pixelised spheres, such as fast spherical harmonic 
transforms. For example, the following constant latitude pix- 
elisations of the sphere have been used extensively in astro- 
nomical applications and beyond: the equi-angular pixelisation 
( [Driscoll & Healy||1994| ); the Hierarchical Equal Area isoLati- 
tude Pixelisa 5oiQ(HEALPix) ([G orski et al| |2005t ; the IGLOC0 
pixelisati on ([Crittenden & Turok[|1998) ; and the GLESF0 pix- 
elisation ( [Doroshkevich et aL 2005 ). Furthermore, at present no 
data-sphere compression tool is available publicly. 

Motivated by the requirement for a data-sphere compres- 
sion algorithm defined on a constant latitude pixelisation of the 
sphere, and a publicly available tool to compress such data, we 
develop wavelet-based compression algorithms for data defined 
on the HEALPix pixelisation scheme and make our implemen- 
tation of these algorithms available publicly. We are driven pri- 
marily by the need to compress CMB data, hence the adoption of 
the HEALPix scheme (the pixelisation scheme used currently to 
store and distribute these data). Wavelet transforms are expected 
to perform well in the energy compression stage of the compres- 
sion algorithm, thus we adopt a Haar wavelet transform defined 
on the sphere f or this stage ( following a similar framework to 
that outlined by Barreiro et al. 2000 ). We could have chosen a 
filter bank based wavelet framework, such as those developed 
by |Starck et aL] ( |2006| ) and |Wiaux et al] ( |2008| ), however, for 
now, we adopt discrete Haar wavelets due to their simplicity and 
computational efficiency. 



http://healpix.jpl.nasa.gov/ 
2 http : //www . cita . utoronto . ca/~ crittend/pixel . html 



The remainder of this paper is organised as follows. In 
Sec. [2] we describe the compression algorithms developed, first 
discussing Haar wavelets on the sphere, before explaining the 
encoding adopted in our lossless and lossy compression algo- 
rithms. The performance of our compression algorithms is then 
evaluated in Sec. [3] We first examine compression performance 
for CMB data and study the implications of any errors on cosmo- 
logical inferences drawn from the data. We then examine com- 
pression performance for topographical data and environmental 
illumination maps. Concluding remarks are made in Sec. [4] 



2. Compression algorithms 

The wavelet-based compression algorithms that we develop to 
compress data- spheres consist of a number of stages. Firstly, 
a Haar wavelet transform is performed to reduce the entropy 
of the data, followed by quantisation and encoding stages. The 
resulting algorithm is lossless to numerical precision. We then 
develop a lossy compression algorithm by introducing an addi- 
tional thresholding stage, after the wavelet transform, in order to 
reduce the entropy of the data further. Allowing a small degrada- 
tion in the quality of decompressed data in this manner improves 
the compression ratios that may be attained. In this section we 
first discuss the Haar wavelet transform on the sphere that we 
adopt, before outlining the subsequent stages of the lossless and 
lossy compression algorithms. We make our SZIP program that 
implements these algorithms available publicly Furthermore, 
we also provide an SZIP user manual (McEwen & Eyers, 2010), 
which discusses installation, usage (including a description of 
all compression options and parameters), and examples. 



2.1. Haar wavelets on the sphere 

The description of wavelets on the sphere given h ere is based 
largely on the generic lifting scheme proposed by Schroder & 
Sweldens] ( |1995| ) and also on the specific definition of Haar 
wavelets on a HEALPix pixelised sphere proposed by |Barreiro 
et al. (2000). However, our discussion and definitions contain a 



numbe r of notable differences to those given by Barr eiro et al. 
( 2000) since we construct an orthonormal Haar basis on the 
sphere and describe this in a multiresolution setting. 

We begin by defining a nested h ierarchy of spaces as required 
for a multiresolution analysis (see Daubechies (1992) for a more 
detailed discussion of multiresolution analysis). Firstly, consider 
the approximation space V) on the sphere S 2 , which is a subset of 
the space of square integrable functions on the sphere, i.e. Vj c 
L 2 (S 2 ). One may think of Vj as the space of piecewise constant 
functions on the sphere, where the index j corresponds to the 
size of the piecewise constant regions. As the resolution index 
j increases, the size of the piecewise constant regions shrink, 
until in the limit we recover L 2 (S 2 ) as j — > oo. If the piecewise 
constants regions of S 2 are arranged hierarchically as j increases, 
then one can construct the nested hierarchy of approximation 
spaces 

Vi c V 2 c-..c V 7 cL 2 (S 2 ), (1) 



|http : // www . glesp . nbi . dk/ 



where coarser (finer) approximation spaces correspond to a 
lower (higher) resolution level j. For each space Vj we de- 
fine a basis with basis elements given by the scaling functions 
(pj r k e Vj, where the k index corresponds to a translation on the 
sphere. Now, let us define Wj to be the orthogonal complement 



http : //www . szip . org . uk 
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of Vj in Vj+i, where the inner product of two square integrable 
functions on the sphere /, g e L 2 (S 2 ) is defined by 

<f\8)= f f(oj)g*(aj) dQ , 
Js 2 

where cj = (6, (p) denotes spherical coordinates with colatitude 
6 e [0, 7r] and longitude (p e [0, 2n), * denotes complex con- 
jugation and d£l - sin#d#d<£ is the usual rotation invariant 
measure on the sphere. Wj essentially provides a space for the 
representation of the components of a function in Vj+i that can- 
not be represented in Vj, i.e. Vj+\ = Vj © Wj. For each space 
Wj we define a basis with basis elements given by the wavelets 
i/fjjc e Wj. The wavelet space Wj encodes the difference (or de- 
tails) between two successive approximation spaces Vj and Vj+\. 
By expanding the hierarchy of approximation spaces, the highest 
level (finest) space j = J, can then be represented by the low- 
est level (coarsest) space j = 1 and the differences between the 
approximation spaces that are encoded by the wavelet spaces: 



j-i 



Vj = v 1 ®Q)w j 



(2) 



7=1 



Let us now relate the generic description of multiresolution 
spaces given above to the HEALPix pixelisation. The HEALPix 
scheme provides a hierarchical pixelisation of the sphere and 
hence may be used to define the nested hierarchy of approx- 
imation spaces explicitly. The piecewise constant regions of 
the function spaces Vj discussed above now correspond to the 
pixels of the HEALPix pixelisation at the resolution associ- 
ated with Vj. To make the association explicit, let Vj corre- 
spond to a HEALPix pixelised sphere with resolution parameter 
A^side = 2 7 " 1 (HEALPix data- spheres are represented by the reso- 
lution parameter N^q, which is related to the number of pixels in 
the pixelisation by N = l2N S i^ Q 2 ). In the HEALPix scheme, each 
pixel at level j is subdivided into four pixels at level j+ 1, and the 
nested hierarchy given by ([T]) is satisfied. The number of pixels 
associated with each space Vj is given by = 12 x 4 7 " 1 , where 
the area of each pixel is given by Aj = 4n/Nj = 7r/(3x4 7_1 ) (note 
that all pixels in a HEALPix data-sphere at resolution j have 
equal area). It is also useful to note that the number and area of 
pixels at one level relates to adjacent levels through Nj+i - 4Nj 
and Aj+i = Ay/4 respectively. 

We are now in a position to define the scaling functions and 
wavelets explicitly for the Haar basis on the nested hierarchy of 
HEALPix spheres. In this setting the index k corresponds to the 
position of pixels on the sphere, i.e. for Vj we get the range of 
values k = 0, • • • , Nj - 1 , and we let Pj^ represent the region of 
the kth pixel of a HEALPix data-sphere at resolution j. For the 
Haar basis, we define the scaling function 0^ at level j to be 
constant for pixel k and zero elsewhere: 



<f>jjc(w) = 



j <» G p j,k 
elsewhere . 



The non-zero value of the scaling function 1 / yAy is chosen to 
ensure that the scaling functions for k = 0, • • • , Nj ■ - 1 do 
indeed define an orthonormal basis for Vj. Before defining the 
wavelets explicitly, we fix some additional notation. Pixel Pj^ 
at level j is subdivided into four pixels at level j + 1 , which we 
label P j+ i,k , Pj+im, Pj+i,k 2 and p j+ite> as illustrated in Fig. [I] 
An orthonormal basis for the wavelet space Wj, the orthogonal 
complement of Vj, is then given by the following wavelets of 
type m = {0, 1,2}: 

i/f° 1k (o)) = [0;+i,*oM - 07+UiM + <Pj+i,k 2 {u>) ~ (f>j+i,k 3 (oj)]/2 ; 



= [07+Uo(^O + 07+UiM - - 07+U 3 (^)]/ 2 ' 

^lk(^ - - - <Pj+l,k 2 {u>) + 0y+l,jk 3 M]/2 . 

We require three independent wavelet types to construct a com- 
plete basis for Wj since the dimension of Vj+\ (given by Nj+\) 
is four times larger than the dimension of Vj (the approximation 
function provides the fourth component). The Haar scaling func- 
tions and wavelets defined on the sphere above are illustrated in 
Kg. [1] 

Let us check that the scaling functions and wavelets satisfy 
the requirements for an orthonormal multiresolution analysis as 
outlined previously. We require Wj to be orthogonal to Vj, i.e. 
we require 



f (f>j,k(aj)^ k ,(aj)d£l = 0. 
Js 2 



This is always satisfied since for k' ± k the scaling function and 
wavelet do not overlap and so the integrand is zero always, and 
for k! = k we find 

f (f>j,k(aj)if^ k (u)dQcc [ if,™(aj)dQ = 0. 
Js 2 Js 2 

We also require Wj to be orthogonal to Wj> for all j and /. Again, 
if the basis functions do not overlap (i.e. k ^ k') then this require- 
ment is satisfied automatically, and if they do (i.e. k = k') then 
the wavelet at the finer level / > j will always lie within a region 
of the wavelet at level j with constant value, and consequently 



0. 



Finally, to ensure that we have constructed an orthonormal 
wavelet basis for Wj, we check the orthogonality of all wavelets 
at level j: 



dmm'dkk' 



1 



2^/A 



7+1 



v j &mm' 1 



where for m ± rn' the positive and negative regions of the inte- 
grand cancel exactly and for k ± k f the wavelets do not overlap 
and so the integrand is zero always. Note that in the previous 
expression the final Aj term arises from the area element dD. 
The Haar approximation and wavelet spaces that we have con- 
structed therefore satisfy the requirements of a orthonormal mul- 
tiresolution analysis on the sphere. Although the orthogonal na- 
ture of these spaces is important, a different normalisation could 
be chosen. It is now possible to define the analysis and synthesis 
of a function on the sphere in this Haar wavelet multiresolution 
framework. 

The decomposition of a function defined on a HEALPix 
data-sphere at resolution /, i.e. fj e Vj, into its wavelet and 
scaling coefficients proceeds as follows. Consider an intermedi- 
ate level j+l < J and let fj+i be the approximation of fj in 
Vj+\. The scaling coefficients at the coarser level j are given by 
the projection of fj+\ onto the scaling functions (pjX- 

= I fj + i(oj)(pj,k(oj)dQ, 
Js 2 

= (Aj+ijq + Aj+i,ki + Aj+ifo + Aj+i,k 3 ) VaJ/^ ' 

where we call Aj^ the approximation coefficients since they de- 
fine the approximation function fj e Vj. At the finest level /, we 
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naturally associate the function values of fj with the approxima- 
tion coefficients of this level. The wavelet coefficients at level j 
are given by the projection of fj+i onto the wavelets ifsj k : 

^E^/ j+1 (^)dQ, 

giving 

y),k = Uj+Uo + Aj+iM - Aj+ito - Aj+ite) V^// 4 

and 

where we call yj k the detail coefficients of type m. Starting from 
the finest level /, we compute the approximation and detail co- 
efficients at level / - 1 as outlined above. We then repeat this 
procedure to decompose the approximation coefficients at level 
J - I (i.e. the approximation function fj-i), into approximation 
and detail coefficients at the coarser level J - 2. Repeating this 
procedure continually, we recover the multiresolution represen- 
tation of fj in terms of the coarsest level approximation f\ and 
all of the detail coefficients, as specified by ^ and illustrated in 
Fig. [2] In general it is not necessary to continue the multireso- 
lution decomposition down to the coarsest level j = I; one may 
choose to stop at the intermediate level Jq, where 1 < Jq < J. 

The function fjeVj may then be synthesised from its ap- 
proximation and detail coefficients. Due to the orthogonal nature 
of the Haar basis, the approximation coefficients at level j + 1 
may be reconstructed from the weighted expansion of the scaling 
function and wavelets at the coarser level j, where the weights 
are given by the approximation and detail coefficients respec- 
tively. Writing this expansion explicitly, the approximation co- 
efficients at level j + 1 are given in terms of the approximation 
and detail coefficients of the coarser level j: 







+ y\k 


+ y\ k )h 




-*/'+Ui 




+ y\ k 








= (hk + y\ k 




-y%)i* 










+y%v- 





Repeating this procedure from level j = Jq up to j = J, one finds 
that the signal fjtVj may be written 

Njq-1 J-lNj-1 2 

k=0 j=J k=0 ra=0 

2.2. Lossless compression 

The Haar wavelet transform on the sphere defined in the pre- 
vious section is used as the first stage of the lossless compres- 
sion algorithm. The purpose of this stage is to compress the en- 
ergy content of the original data. In order to recover the original 
data from its compressed representation, the energy compres- 
sion stage must be reversible. This requirement limits candidate 
wavelet transforms on the sphere to those that allow the exact re- 
construction of a signal from its wavelet coefficients. We choose 



the Haar wavelet transform on the sphere since it satisfies this re- 
quirement and also because of its simplicity and computational 
efficiency. 

We demonstrate the energy compression achieved by the 
Haar wavelet transform with an example. In Fig.[3ja) we show a 
histogram of the value of each datum contained in a data-sphere 
that we wish to compress (although the particular data-sphere 
examined here is not of considerable importance, for the pur- 
pose of this demonstration we use the CMB data described in 
Sec. |3.1| ). In Fig. [3jb) we show a histogram of the value of the 
Haar approximation and detail coefficients for the same data- 
sphere. Notice how the wavelet transform has compressed the 
energy of the signal so that it is contained predominantly within a 
smaller range of values. Entropy provides a measure of the infor- 
mation content of a signal and is defined by H = - £i P t log 2 Pi, 
where P t is the probability of symbol i occurring in the data. By 
compressing the energy of the data so that certain symbols will 
have higher probability, we reduce its entropy. The aforemen- 
tioned entropy value also provides a theoretical limit on the best 
compression of data attainable with entropy encoding, hence by 
reducing the entropy of data the performance of any subsequent 
compression is improved. 

Following the wavelet transform stage of the compression 
algorithm, we perform an entropy encoding stage to compress 
the data. Entropy encoding is a type of variable length encoding, 
where symbols that occur frequently are given short codewords. 
The entropy H of the data gives the mean number of bits per 
datum required to encode the data using an ideal variable length 
entropy code. We adopt Huffman encoding, which produces a 
code that closely approximates the performance of the ideal en- 
tropy code. 

A compression algorithm consisting of the wavelet transform 
and entropy encoding stages described above would work, how- 
ever its performance would be limited. Although the wavelet 
transform compresses the energy of the data, coefficient values 
that are extremely close to one another may take distinct ma- 
chine representations. In order to achieve good compression ra- 
tios, one requires a compressed energy representation of the data 
with a relatively small number of unique symbols. To satisfy this 
requirement we introduce a quantisation stage in our compres- 
sion algorithm before the Huffman encoding. By quantising we 
map similar coefficient values to the same machine representa- 
tion, thus reducing the number of unique symbols contained in 
the data. The quantisation stage does introduce some distortion 
and so the resulting compression algorithm is no longer perfectly 
lossless, but is lossless only to a user specified numerical preci- 
sion. As one increases the precision parameter, lossless compres- 
sion is achieved in the limit. The user specified precision param- 
eter p defines the number of significant figures to retain in the 
wavelet detail coefficients (approximation coefficients are kept 
to the full number of significant figures provided by the machine 
representation). The precision parameter trades-off decompres- 
sion fidelity with compression ratio. The effect of quantisation 
on compression performance is evaluated in Sec. [3] 

For this lossless compression algorithm, data are decom- 
pressed simply by decoding the Huffman encoding of the 
wavelet coefficients, followed by application of the inverse Haar 
wavelet transform on the sphere. 

2.3. Lossy compression 

If we allow degradation to the quality of the decompressed data 
it is possible to achieve higher compression ratios. In this section 
we describe a lossy compression algorithm that trades-off losses 
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Level j + 1 




^IM ^ 2 ,*M 



Fig. 1. Haar scaling function and wavelets if/J^co). Dark shaded regions correspond to negative constant values, light shaded 

regions correspond to positive constant values and unshaded regions correspond to zero. The scaling function and wavelets at level 
j and position k are non-zero on pixel Pj^ only. Pixel Pj^ at level j is subdivided into four pixels at level j + 1, which we label 
Pj+ito, Pj+iM, Pj+ite and P j+iM as defined above. 



in decompression fidelity against compression ratio in a natural 
manner. 

The Haar wavelet representation of a data- sphere decom- 
poses the data into an approximation sphere and detail coef- 
ficients that encode the differences between the approximation 
sphere and the original data-sphere. Many of these detail coef- 
ficients are often close to zero (as illustrated by the histogram 
shown in Fig. [3jb)). If we discard those detail coefficients that 
are near zero, by essentially setting their value to zero, then we 
lose only a small amount of accuracy in the representation of the 
original data but reduce the entropy considerably. By increas- 
ing the proportion of detail coefficients neglected, we improve 
the compression ratio of the compressed data while reducing its 
fidelity in a natural manner. 

Our lossy compression algo rithm is identical to the lossless 
algorithm described in Sec. 22_ but with two additional stages in- 
cluded. Firstly, we introduce a thresholding stage after the quan- 
tisation and before the Huffman encoding stage. The threshold 
level is determined by choosing the proportion of detail coeffi- 
cients to retain. We treat the detail coefficients on all levels iden- 
tically. More sophisticated thresholding algorithms could treat 
the detail coefficients on each level j differently, perhaps us- 
ing an annealing scheme to specify the proportion of detail co- 
efficients to retain at each level. However, we demonstrate in 



Sec. |3.2| that the naive thresholding scheme outlined here per- 
forms very well in practice and so we do not investigate more 
sophisticated strategies. Once the threshold level is determined 
we perform hard thresholding so that all detail coefficients be- 
low this value are set to zero, while coefficients above the thresh- 
old remain unaltered. The thresholding stage reduces the number 
of unique symbols in the data by replacing many unique values 



with zero, hence reducing the entropy of the data and enabling 
greater compression. Furthermore, since many of the data are 
now zero, it is worthwhile to incorporate a run length encoding 
(RLE) stage so that long runs of zeros are encoded efficiently. 
The RLE stage is included after the thresholding and before the 
Huffman encoding stage. RLE introduces some additional en- 
coding overhead, thus it only improves the compression ratio for 
cases where there are sufficiently long runs of zeros. In Sec. 3.2 
we evaluate the performance of the lossy compression algorithm 
described here and examine the trade-off between compression 
ratio and fidelity. Moreover, we also examine cases where the ad- 
ditional overhead due to RLE acts to increase the compression 
ratio. 

For this lossy compression algorithm, data are decompressed 
simply by decoding the RLE and Huffman encoding of the 
wavelet coefficients, followed by application of the inverse Haar 
wavelet transform on the sphere. 



3. Applications 

In this section we evaluate the performance of the lossless and 
lossy compression algorithms on data-spheres that arise in a 
range of applications. The trade-off between compression ratio 
and the fidelity of the decompressed data is examined in detail. 
We begin by considering applications where lossless compres- 
sion is required, before then considering applications when lossy 
compression is appropriate. 
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Level J 




Legend 
O Original 
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Fig. 2. Haar multiresolution decomposition. Starting at the finest level / (the original data-sphere), the approximation and detail 
coefficients at level / - 1 are computed. This procedure is repeated to decompose the approximation coefficients at level J -I (i.e. 
the approximation function fj-i), into approximation and detail coefficients at the coarser level J - 2. Repeating this procedure 
continually, one recovers the multiresolution representation of fj in terms of the coarsest level approximation f Jo and all of the 
detail coefficients. 



3.1. Lossless compression 

The algorithms developed here to compress data defined on the 
sphere were driven primarily by the need to compress large CMB 
data-spheres. CMB data are used to study cosmological models 
of the Universe. Any errors introduced in the data may alter cos- 
mological inferences drawn from it, hence the introduction of 
large errors in the compression of CMB data will not be toler- 
ated. The lossless compression algorithm is therefore required 
for this application. Our lossless compression algorithm is loss- 
less only to a user specified numerical precision (as described in 
detail in Sec. |2.2| ). It is therefore important to ascertain whether 
the small quantisation errors that are introduced by this limited 
precision could affect cosmological inferences drawn from the 
data. We first evaluate the performance of the lossless compres- 
sion of CMB data, before investigating the impact of errors on 
the cosmological information content of the data. 

To evaluate our lossless compression algorithm we use sim- 
ulated CMB data. In the simplest inflationary models of the 
Universe, the CMB is fully described by its angular power spec- 
trum. Using the theoretical angular power spectrum that best 
fits the three-year WMAP observations (i.e. the power spectrum 
defined by the cosmological parameters specified in Table 2 of 
Spergel et al. (2007)), we simulate a Gaussian realisation of the 
CMB temperature anisotropics. Foreground emissions contami- 
nate real observations of the CMB, hence we also consider simu- 



lated maps where a mask is applied to remove regions of known 
Galactic and point source contamination. We apply the conser- 
vative KpO mask associated with the three-year WMAP obser- 
vations (Hinsha w et al.| |2007). These simulated CMB data, with 
and without application of the KpO mask, are illustrated at res- 
olutions A^ S i de = 512 (/ = 9; iV - 3 x 10 6 ) and Af side = 1024 
(/ = 10; N 13 x 10 6 ) in the first column of panels of Fig. 

The simulated CMB data are compressed using the lossless 
compression algorithm for a range of precision values p. RLE is 
applied when compressing the masked data to efficiently com- 
press the runs of zeros associated with the masked regions but 
not for the unmasked data (since the encoding overhead does 
not make it worthwhile). For each precision value, we compute 
the compression ratio achieved and the relative error between the 
decompressed data and the original data. The compression ratio 
is defined by the ratio of the size of the compressed data (in- 
cluding the Huffman encoding table) relative to the size of the 
original data, expressed as a percentage. The error used to eval- 
uate the fidelity of the compressed data is defined by the ratio of 
the mean- square-error (MSE) between the original and decom- 
pressed data-spheres relative to the root-mean-squared (RMS) 
value of the original data-sphere, expressed as a percentage. 
These values are plotted for a range of precision values in the 
third column of panels of Fig. [4] In the second column of panels 
of Fig. [4] residual errors between the original data and decom- 
pressed data reconstructed for a precision parameter p = 3 are 
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Fig. 3. Histograms of original data and wavelet (approxima- 
tion and detail) coefficient values. Although the particular data- 
sphere considered here is not important for the purpose of this 
demonstration, thes e histograms correspond to the CMB data de- 
scribed in Sec. 13.11 Notice how the Haar wavelet transform has 
compressed the energy content of the signal, thereby reducing 
its entropy and allowing for greater compression performance. 



shown, where a compression ratio of 18% is achieved for resolu- 
tion A^ S ide = 1024. Although the precision level p = 3 introduces 
some error in the reconstructed data (2.7% for A^e = 1024), 
the error on each pixel is reassuringly small at typically a fac- 
tor of ~ 100 smaller than the corresponding data value. For the 
precision parameter p = 5, a compression ratio of 40% and an 
error of 0.03% is obtained for resolution A^e = 1024. The er- 
ror introduced by the compression for this case is sufficiently 
small that one might hope that no significant cosmological in- 
formation content is lost in the compressed data. We investigate 
the cosmological information content of the compressed data in 
detail next. 

In the simplest inflationary scenarios the cosmological in- 
formation content of the CMB is contained fully in its angular 
power spectrum. Although the angular power spectrum does not 
contain all cosmological information in non-standard inflation- 
ary settings, anisotropic models of the Universe or various cos- 
mic defect scenarios, we nevertheless use it as a figure of merit to 
determine any errors in the cosmological information content of 
CMB data. To evaluate any loss to the cosmological information 
content of compressed CMB data, we examine the errors that are 
induced in the angular power spectrum of the compressed data. 
We consider here unmasked CMB data only, which simplifies 
the estimation of the angular power spectrum. Before proceed- 
ing, we briefly define the angular power spectrum of the CMB 
and the estimator that we use to compute the power spectrum 
from CMB data. The angular power spectrum Q is given by the 
variance of the spherical harmonic coefficients of the CMB, i.e. 



(a im a* im ) = Cidu'&mm', where Sij is the Kronecker delta symbol 
and the spherical harmonic coefficients ai m are given by the pro- 
jection of the CMB anisotropies AT (6, (p) onto the spherical har- 
monic functions Y{; m (6, cp) through cii m — (AT\Y £m ). If the CMB 
is assumed to be isotropic, then for a given £ the m-modes of 
the spherical harmonic coefficients are independent and iden- 
tically distributed. The underlying Q spectrum may therefore 
be estimated by the quantity Q = Yf m =-t \ a £m\l(2£ +1). Since 
more m-modes are available at higher I, the error on this estima- 
tor reduces as £ increases. This phenomenon is termed cosmic 
variance and arises since we may observe one realisation of the 
CMB only. Cosmic variance is given by (AQ) 2 = 2C 2 /(2£ + 1) 
and provides a natural uncertainty level for power spectrum es- 
timates made from CMB data. Any errors introduced in the an- 
gular power spectrum of compressed CMB data may therefore 
be related to cosmic variance to determine the cosmological im- 
plication of these errors. In Fig. [5] we show the angular power 
spectrum computed for the original and compressed CMB data 
for A^ S ide = 1024, and errors between these spectra, for a range 
of precision parameter values. In the first two columns of panels 
we also highlight the three standard deviation confidence inter- 
val due to cosmic variance. For the precision parameter p = 5, 
we find that essentially no cosmological information content is 
lost in the compressed data. Even for large values of £, for which 
cosmic variance is very small, the error in the recovered power 
spectrum relative to cosmic variance is of the order of a few per- 
cent only. For the case p = 4, still only minimal cosmological 
information content is lost, while for the case p = 3 we begin 
to see a moderate loss of cosmological information. Obviously 
the degree of cosmological information content loss that may be 
tolerated depends on the application at hand. However, we have 
demonstrated that it is possible to compress CMB data to 40% of 
its original size while ensuring that essentially no cosmological 
information content is lost (corresponding to p = 5). If one tol- 
erates a moderate loss of cosmological information then the data 
may be compressed to 18% of its original size (corresponding to 
P = 3). 

Finally, we measure the CPU time required to compress and 
decompress CMB maps. All timing tests are performed on a lap- 
top with a 2.66GHz Intel Core 2 Duo processor and 4GiB of 
RAM. We restrict our attention to unmasked data and to pre- 
cision parameters p = 3 and p = 5 only. Computation times 
are plotted in Fig. [6] for a range of resolutions, where all mea- 
surements are averaged over five random Gaussian CMB sim- 
ulations. Note that computation time increases with precision 
parameter p since the number of unique wavelet coefficients re- 
quiring encoding also increases with p. All stages of our com- 
pression and decompression algorithms are linear in the number 
of data samples, hence the computation time of our algorithms 
scales linearly with the number of samples on the sphere, i.e. as 
0(Af S ide 2 ), as a l so apparent from Fig. [6] 



3.2. Lossy compression 

In certain applications the loss of a small amount of informa- 
tion from a data-sphere is not catastrophic. For example, in com- 
puter graphics environmental illumination maps and reflectance 
functions that are defined on the sphere are used in render- 
ing synthetic images (e.g. Rama moorthi & Hanrahan 2004]). In 
this application accuracy is determined by human perception, 
hence errors may be tolerated if they are not suitably noticeable. 
Moreover, data- spheres that are input to reflectance algorithms 
are not viewed directly, thus moderate errors in these data may 
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(a) CMB at Af side = 512 (13MB) 




------ ----- Precision p 

(b) Residual for p = 3 at Af side = 512 (2.5MB) (c) Performance at Af side = 512 



(d) Masked CMB at Af side = 512 (13MB) 




(e) Masked residual for p = 3 at iV side = 512 (f) Masked performance at A^ side = 512 
(2.1MB) 



^^^^^^^^ 



(g) CMB at Af side = 1024 (50MB) 






(h) Residual for p = 3 at Af side = 1024 (9.1MB) (i) Performance at Af side = 1024 





(j) Masked CMB at N side = 1024 (50MB) 





(k) Masked residual for p = 3 at iV side = 1024 (1) Masked performance at 
(7.6MB) Af side = 1024 



Fig. 4. Lossless compression of simulated Gaussian CMB data, with and without application of the KpO mask (data-spheres are 
displayed using the Mollweide projection). The first column of panels shows the original simulated CMB data, with corresponding 
file sizes also specified. The second column of panels shows the residual of the original and decompressed CMB data reconstructed 
for a precision parameter of p = 3, with corresponding file sizes of the compressed data also specified. Note that the colour scale 
between the first and second column of panels is scaled by a factor of 100. RLE is applied when compressing the masked data to 
efficiently compress the runs of zeros associated with the masked regions. The third column of panels shows the trade-off between 
compression ratio and decompression fidelity with precision parameter p. Compression ratio (solid black line; left axis) is defined 
by the ratio of the compressed file size relative to the original file size, expressed as a percentage. The decompression error (dashed 
red line; right axis) is defined by the ratio of the MSE between the original and decompressed data relative to the RMS value of the 
original data, expressed as a percentage. 



not necessarily produce noticeable errors in rendered images. 
Lossy representations of data-spheres in computer graphics are 
therefore not only tolerated, but are often desired as they may 
improve the computational efficiency of rendering algorithms 
(e.g. Ng et al. 2004). For data compression purposes, our lossy 



compression algorithm is certainly appropriate and may be ap- 
plied in order to achieve higher compression ratios. In addition 
to the environmental illumination maps discussed previously, we 
also compress Earth topography data and evaluate the perfor- 
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(d) Angular power spectra for p = 4 



(e) Absolute error for p = 4 



(f) Error relative to cosmic variance for p = 4 




Fig. 5. Reconstructed angular power spectrum of compressed CMB data at resolution A^e = 1024. Each row of panels shows 
the reconstructed power spectrum and errors for a particular compression precision parameter p. In the first column of panels, the 
power spectrum reconstructed from the original CMB data are given by the red dots, the power spectrum reconstructed from the 
compressed CMB data are given by the blue dots and the underlying power spectrum of the simulated model is shown by the 
solid black line, with three standard deviation cosmic variance regions shaded in yellow. Note that in some instances the red and 
blue dots align closely and may not both be visible. In the second column of panels, the absolute error between the power spectra 
reconstructed from the original and compressed CMB data is given by the blue dots, with three standard deviation cosmic variance 
regions shaded in yellow. In the third column of panels, the absolute error between the power spectra reconstructed from the original 
and compressed CMB data is expressed as a percentage of cosmic variance. Note that the scale of the vertical axis changes by an 
order of magnitude between each row of the third column of panels. 



mance of both our lossless and lossy compression algorithms on 
both of these types of data. 

The Earth topography and environmental illumination data- 
spheres considered here are all obtained from real-world obser- 
vations. The original data are illustrated in the first column of 
panels in Fig. [7] The topographical data are represented at a 
HEALPix resolution of N s ^q = 512. The environmental illumi- 
nation data-spheres were constructed by Debevec (1998) and are 
available publicl)^] These data defined on the sphere were con- 



structed by taking two photographic images of a mirrored ball 
from different locations, and mapping the observed intensities of 
the images onto the surface of a sphere. The illumination maps 
are available in a cross-cube format and have been converted 
to a HEALPix data- sphere at resolution A^e = 256 (/ = 8; 
N ^ 0.8 x 10 6 ). We consider environmental illumination data 
that was captured in this manner from within Galileo's Tomb 
in Santa Croce, Florence, St. Peter's Basilica in Rome and the 
Uffizi Gallery in Florence. 



http : //www . debevec . org/Probes/ 
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Fig. 6. Computation time required to compress (blue/dashed 
line) and decompress (green/dot-dashed line) simulated 
Gaussian CMB data of various resolution N s ^ e . Computation 
times are averaged over five simulated Gaussian CMB maps and 
are shown for precisions parameters p = 3 (squares) and p = 5 
(triangles). 0(N S {^ 2 ) scaling is shown by the heavy black/solid 
line. 

Lossless and lossy compressed versions of the data are illus- 
trated in Fig. [7] For the lossless compression we use a precision 
parameter of p = 3 in the quantisation stage of the compression 
since ultimately we are concerned with achieving a high com- 
pression ratio and will allow some quantisation error. For the 
lossy compression we again use a precision parameter of p = 3 
and retain only 5% of the detail coefficients in the thresholding 
stage of the compression algorithm. Compression ratios of ap- 
proximately 40:1 are achieved for the lossy compression of both 
the topographic and environmental illumination data. Although 
it is possible to discern errors in the lossy compressed data, the 
overall structure and many of the details of the original data are 
well approximated in this highly compressed representation. 

In Fig. [8] we evaluate the performance of the compression of 
these data more thoroughly. Firstly, for the lossless compression 
we examine the trade-off between compression ratio and decom- 
pression error with respect to the precision parameter p (see the 
first column of panels of Fig. [HJ. For both the topographic and 
illumination data it is apparent that we may reduce the preci- 
sion parameter to p = 3, while introducing quantisation error 
on the order of a few percent only. If the precision parameter is 
reduced to p = 2, quantisation errors on the order of 10-20% 
appear. We therefore choose p = 3 for the lossless compression 
since this maximises the compression ratio while introducing an 
allowable level of quantisation error. We then examine the effect 
of increasing the threshold level, by reducing the proportion of 
detail coefficients retained, on the performance of the lossy com- 
pression (see the second column of panels of Fig. [8]). Retaining 
100% of the detail coefficients corresponds to the lossless com- 
pression case, where no RLE is included. RLE is included for 
all other lossy compression cases. Notice that when retaining 
50% of the detail coefficients, the resulting improvement to the 
compression ratio is offset by the additional encoding overhead 
of the RLE. Consequently, the compression ratio when retain- 
ing 50% of coefficients (with RLE) is often worse than when 
retaining 100% of coefficients (without RLE). As the proportion 
of detail coefficients that are retained is reduced, the improve- 
ment to compression ratio quickly exceeds the additional over- 
head of RLE. Decompression error remains at approximately 5% 
when retaining only 5% of the detail coefficients, but increases 
quickly as the proportion of coefficients retained is reduced fur- 
ther. Retaining 5% of coefficients therefore appears to give a 



good trade-off between compression ratio and fidelity of the de- 
compressed data, justifying this choice for the results presented 
in Fig. [7] For this choice, the topographic and illumination data 
are compressed to a ratio of approximately 40:1, while introduc- 
ing errors of approximately 5%. The images illustrated in Fig. [7] 
show that errors of this order are not significantly noticeable and 
are likely to be acceptable for many applications. Also notice 
that the curves shown in Fig. [8] for the environmental illumina- 
tion data are similar, indicating that the characteristics of natural 
illumination are to some extent independent of the scene. One 
would therefore expect the compression performance observed 
for the data- spheres considered here to be typical of environ- 
mental illumination data in general. Although we have made a 
number of arbitrary choices here regarding acceptable levels of 
distortion in the decompressed data, one may of course choose 
the number of detail coefficients retained that provides a trade- 
off between compression ratio and fidelity that is suitable for the 
application at hand. 

4. Concluding remarks 

We have developed algorithms to preform lossless and lossy 
compression of data defined on the sphere. These algorithms 
adopt a Haar wavelet transform on the sphere to compress the 
energy content of the data, prior to quantisation and Huffman en- 
coding stages. Note that the resulting lossless compression algo- 
rithm is lossless to a user specified numerical precision only. The 
lossy compression algorithm incorporates, in addition, a thresh- 
olding stage so that only a user specified proportion of detail 
coefficients are retained, and a RLE stage. By allowing a small 
degradation to the fidelity of the compressed data in this manner, 
significantly greater compression ratios can be attained. 

The performance of these compression algorithms has been 
evaluated on a number of data-spheres and the trade-off between 
compression ratio and the fidelity of the decompressed data has 
been examined thoroughly. Firstly, the lossless compression of 
CMB data was performed and it was demonstrated that the data 
can be compressed to 40% of their original size, while ensuring 
that essentially none of the cosmological information content 
of the data is lost. A compression ratio of approximately 20% 
can be achieved if a small loss of cosmological information is 
tolerated. Secondly, the lossy compression of Earth topography 
and environmental illumination data was performed. For both of 
these data types compression ratios of approximately 40:1 can 
be achieved, while introducing relative errors of approximately 
5%. By taking account of the geometry of the sphere that these 
data live on, we achieve superior compression performance than 
naively applying standard compression algorithms to the data 
(such as a JPEG compression of the six planes of a cross-cube 
representation of a data-sphere, for example). On inspection of 
the decompressed data, it is possible to discern errors in the re- 
covered data by eye, nevertheless the overall structure and many 
of the details of the data are well approximated. The accuracy of 
the compressed data remains suitable for many applications. 

A number of avenues exist to improve the performance of 
the current compression algorithms. We choose Haar wavelets 
on the sphere for the energy compression stage due to their sim- 
plicity and computational efficiency. Howe ver, the scale discre- 
tised wavelet methodology developed by Wia ux et al.| (2008) 
may yield better compression performance due to the ability 
to represent directional structure in the original data efficiently. 
However, this wavelet transform is computed in spherical har- 
monic space; forward and inverse spherical harmonic trans- 
forms are not exact on a HEALPix pixelisation. Consequently, 
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(d) Galileo: original (3.2MB) 



(e) Galileo: lossless compressed (0.21MB) (f) Galileo: lossy compressed (0.07MB) 





(g) St Peter's: original (3.2MB) 



(h) St Peter's: 
(0.20MB) 



lossless compressed (i) St Peter's: lossy compressed (0.08MB) 




(j) Uffizi: original (3.2MB) 



(k) Uffizi: lossless compressed (0.19MB) 



(1) Uffizi: lossy compressed (0.10MB) 



Fig. 7. Compressed data for lossy compression applications (data-spheres are displayed using the Mollweide projection). Each row 
of panels shows the original, lossless and lossy compressed data-spheres. File sizes for each data-sphere are also specified. The 
lossless compressed data shown in the second column of panels is performed with a precision parameter of p = 3. The lossy 
compressed data shown in the third column of panels is performed by retaining 5% of detail coefficients only and including a RLE 
stage. The full dynamic range of these images may not be visible in printed versions of this figure, hence this figure is best viewed 
online. 



greater errors will be introduced in any compression strategy 
based on this transform. For alternative constant latitude pixeli- 
sations of the sphere, however, exact (and fast) sphe rical har- 
monic transfo rms do exist and could be adopted (McEw en & 
Wiaux, 201 1 ). Nevertheless, the implementation of the scale dis- 
cretised wavelet is also considerably more demanding compu- 
tationally. Scope also remains to improve the lossy compres- 
sion algorithm by treating the detail coefficients at each level 
differently, perhaps by using an annealing scheme to dynami- 
cally specify the proportion of detail coefficients to retain at each 
level. Nevertheless, the naive thresholding strategy adopted cur- 
rently has been demonstrated to perform very well. 

The algorithms that we have developed to compress data 
defined on the sphere have been demonstrated to perform well 
and we hope that our publicly available implementation will 



now find practical use. Obviously these compression algorithms 
may be used to reduce the storage and dissemination costs of 
data defined on the sphere, but the compressed representation 
of data- spheres may also find use in the develo pment of fast al - 
gorithms that exploit this representation (e.g. |Ng et al.||2004| ). 
Furthermore, data defined on other two-dimensional manifolds 
may be compressed by first mapping these data to a sphere, 
before applying our data-sphere compression algorithms. We 
are currently pursuing this idea for the compression of three- 
dimensional meshes used to represent computer graphics mod- 
els. 



11 



J. D. McEwen, Y. Wiaux and D. M. Eyers: Data compression on the sphere 






Fig. 8. Compression performance for lossy compression applications. Each row of panel shows performance plots for various data- 
spheres. The first column of panels shows the trade-off between compression ratio and decompression fidelity with precision pa- 
rameter p for lossless compression. The second column of panels shows the same trade-off, but with respect to the number of detail 
coefficients retained in the lossy compression. A precision parameter of p = 3 is used for all lossy compression results illustrated 
here. Compression ratio (solid black line; left axis) and decompression error (dashed red line; right axis) are defined in the caption 

of Fig. g 
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